A Brief Introduction to Snakemake

Eric C. Anderson

NMFS Unix, Sedna, and Bioinformatics Workshop, Wednesday October 19, 2022

Setting up our workspace!

  • I have a section in the course repository called Snakemake-Example, which will be our current working directory for the rest of today.
  • Also, you must pull down the latest changes to the repo.
Paste this in to your shell
cd ~/nmfs-bioinf-2022/Snakemake-Example
git pull origin main

Then, if you are not already running an interactive session with two cores, do:

Only do this if you aren't already in an interactive session
srun -c 2 -t 03:00:00 --pty /bin/bash
  • Snakemake relies on the conda and mamba package managers. Giles has set up a system-wide conda installation that works beautifully.
  • Here we test to see if you already have conda:
Paste this in to your shell
which conda

If the response looks something like this:

/usr/bin/which: no conda in (/home/eanderson/.local/bin:/home/eanderson/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)

Then you need to add conda initialization to your ~/.bashrc like this:

Paste into your shell if you do not already have conda
echo 'eval "$(/opt/bioinformatics/miniconda3/bin/conda shell.bash hook)"' >> ~/.bashrc
source ~/.bashrc

If you already have conda then when you did which conda you might have gotten something like

/Users/username/miniconda3/bin/conda

OR:

conda ()
{
    \local cmd="${1-__missing__}";
    case "$cmd" in
        activate | deactivate)
            __conda_activate "$@"
        ;;
        install | update | upgrade | remove | uninstall)
            __conda_exe "$@" || \return;
            __conda_reactivate
        ;;
        *)
            __conda_exe "$@"
        ;;
    esac
}

In those cases, you should not need to add any lines to your .bashrc.

What the Heck is Snakemake?

  • A Python-based “Workflow Management System”
  • Allows you to define a complex (bioinformatic) workflow as a series of steps that involve input files and output files.
  • It identifies the dependencies between the steps and then runs all the steps needed to create a requested output file.
  • This greatly simplifies the orchestration of bioinformatics, and makes it much easier to find and re-run failed jobs.
  • Incredibly valuable for reproducible research:
    • Not just so others can reproduce your results
    • Also useful for you to quickly run your workflow on different clusters, etc.

That sounds pretty jargony!

  • Illustrate with an example
  • Hope that it piques the curiosity of some

Our Small Example: GATK Best Practices “Light”

flowchart TD
  A(fastq files from 3 samples: our raw data) --> B(Trim the reads: trimmomatic)
  B --> C(Map the reads to a reference genome: bwa mem)
  C --> D(Mark PCR and optical duplicates: MarkDuplicates)
  D --> E(Make gVCF files for each sample: HaplotypeCaller)
  E --> F(Load gVCFs into Genomic DB: GenomicsDBImport)
  F --> G(Create VCFs from Genomic DB: GenotypeGVCFs)

A mini data set that only takes about 1.5 minutes to run through the major steps of a GATK-like variant calling workflow

  • Heavily subsampled Chinook salmon reads.
  • Three paired-end fastqs, and data only from three or four chromosomes.
  • We will trim it, map it, mark duplicates, then make one gVCF file per individual.
  • Then, to save time, we call variants only on one chromosome: CM031202.1.

Setting up our workspaces

It should be simple to set this up. The code at right does the following:

  • cd to the Snakemake-Example data directory inside the nmfs-bioinf-2022 repo that is in your home directory.
  • Activate the snakemake-7.7.0 conda environment on Sedna.
  • Test to make sure that you have snakemake on your PATH
Paste this into your shell
# From the home directory of your ConGen server
cd ~/nmfs-bioinf-2022/Snakemake-Example
conda activate /opt/bioinformatics/miniconda3/envs/snakemake-7.7.0

# To make sure it is working, print the help information
# for snakemake
snakemake --help

Initial Configuration of our work directory

  • We can use the Unix tree utility to see what the Snakemake-Example directory contains.
  • Within the Snakemake-Example directory, type tree at the command line. This shows:
    • A README.md with installation instructions
    • A Snakefile. Much more about that later.
    • A directory data with three pairs of FASTQ files
    • A directory env that has information to install necessary software with conda
    • A directory hpcc-profiles that has some subdirectories
    • A directory resources that has two subdirectories
      • adapters: info for trimming Illumina adapters
      • genome.fasta: a FASTA file with the reference genome
--% tree
.
├── README.md
├── Snakefile
├── data
│   ├── A_R1.fastq.gz
│   ├── A_R2.fastq.gz
│   ├── B_R1.fastq.gz
│   ├── B_R2.fastq.gz
│   ├── C_R1.fastq.gz
│   └── C_R2.fastq.gz
├── env
│   └── snakemake-example.yml
├── hpcc-profiles
│   └── slurm
│       └── sedna
│           ├── config.yaml
│           └── status-sacct.sh
└── resources
    ├── adapters
    │   ├── NexteraPE-PE.fa
    │   ├── TruSeq2-PE.fa
    │   ├── TruSeq2-SE.fa
    │   ├── TruSeq3-PE-2.fa
    │   ├── TruSeq3-PE.fa
    │   └── TruSeq3-SE.fa
    └── genome.fasta

7 directories, 18 files

How would you tackle this in a Unix way?

Consider the first two “steps”

flowchart TD
  H(fastq files from 3 samples: our raw data) --> I(Trim the reads: trimmomatic)
  I --> J(Map the reads to a reference genome: bwa mem)

Some pseudo-shell code

Do not evaluate this
# cycle over fastqs and do the trimming
for S in A B C; do
  trimmomatic data/${S}_R1.fastq.gz data/S{S}_R2.fastq.gz \
    trimmed/${S}_R1.fastq.gz trimmed/${S}_R1_unpaired.fastq.gz \
    trimmed/${S}_R2.fastq.gz trimmed/${S}_R2_unpaired.fastq.gz \
    other-arguments-etc...
done 


# cycle over trimmed fastqs and do the mapping
for S in A B C; do
  bwa mem resources/genome.fasta \
    trimmed/${S}_R1.fastq.gz trimmed/${S}_R2.fastq.gz
done

What are some issues here?

  1. Ah crap! I forgot to index genome.fasta!
  2. This does not run the jobs in parallel!

Possible solutions for #2?

You can get things done in parallel using SLURM’s sbatch (which you should be using anyway).

Going about doing this with SLURM (a sketch…)

Consider the first two “steps”

flowchart TD
  H(fastq files from 3 samples: our raw data) --> I(Trim the reads: trimmomatic)
  I --> J(Map the reads to a reference genome: bwa mem)

Some pseudo-shell code

Do not evaluate this
# cycle over fastqs and dispatch each trimming job to SLURM
for S in A B C; do
  sbatch my-trim-script.sh $S
done 

# ONCE ALL THE TRIMMING IS DONE...
# cycle over trimmed fastqs and dispatch each mapping job to SLURM
for S in A B C; do
  sbatch my-map-script $S
done

What is not-so-great about this?

  1. I have to wait for all the jobs of each step to finish
  2. I have to explicitly start each “next” step.
  3. If some jobs of a step fail, it is a PITA to go back and figure out which ones failed.
  4. The dependence between the outputs of the trimming step and the mapping step are implicit based on file paths buried in the scripts, rather than explicit.

The Advantages of Snakemake

  • The dependence between input and output files is explicit
  • This lets snakemake identify every single job that must be run—and the order they must be run in—for the entire workflow (all the steps)
  • This maximizes the number of jobs that can be run at once.
  • The necessary steps are determined by starting from the ultimate outputs that are desired or requested…
  • …then working backward through the dependencies to identify which jobs must be run to eventually get the ultimate output.
  • This greatly simplifies the problem of re-running any jobs that might have failed for reasons “known only to the cluster.”

Snakemake is a program that interprets a set of rules stored in a Snakefile

Some explanations:

  • Rule blocks: the fundamental unit
  • Correspond to “steps” in the workflow
  • Keyword “rule” + name + colon
  • Indenting like Python/YAML
  • Typically includes sub-blocks of input, output, and shell
  • (Also params, log, benchmarks, conda, etc.)
Snakefile


SAMPLES = ['A', 'B', 'C']


rule genome_faidx:
  input:
    "resources/genome.fasta",
  output:
    "resources/genome.fasta.fai",
  log:
    "results/logs/genome_faidx.log",
  envmodules:
    "bio/samtools/1.15.1"
  shell:
    "samtools faidx {input} 2> {log} "


rule genome_dict:
  input:
    "resources/genome.fasta",
  output:
    "resources/genome.dict",
  log:
    "results/logs/genome_dict.log",
  envmodules:
    "bio/samtools/1.15.1"
  shell:
    "samtools dict {input} > {output} 2> {log} "


rule bwa_index:
  input:
    "resources/genome.fasta"
  output:
    multiext("resources/genome.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa"),
  log:
    "results/logs/bwa_index/bwa_index.log"
  envmodules:
    "aligners/bwa/0.7.17"
  shell:
    "bwa index {input} 2> {log} "




rule trim_reads:
  input:
    r1="data/{sample}_R1.fastq.gz",
    r2="data/{sample}_R2.fastq.gz",
  output:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r1_unp="results/trimmed/{sample}_R1.unpaired.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    r2_unp="results/trimmed/{sample}_R2.unpaired.fastq.gz",
  log:
    "results/logs/trim_reads/{sample}.log"
  envmodules:
    "bio/trimmomatic/0.39"
  shell:
    " java -jar /opt/bioinformatics/bio/trimmomatic/trimmomatic-0.39/trimmomatic-0.39.jar  PE {input} {output} "
    "  ILLUMINACLIP:resources/adapters/TruSeq3-PE-2.fa:2:30:10  "
    "  LEADING:3  "
    "  TRAILING:3  "
    "  SLIDINGWINDOW:4:20  "
    "  MINLEN:36 2> {log} "




rule map_reads:
  input:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    genome="resources/genome.fasta",
    idx=multiext("resources/genome.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa")
  output:
    "results/bam/{sample}.bam"
  log:
    "results/logs/map_reads/{sample}.log"
  params:
    RG="-R '@RG\\tID:{sample}\\tSM:{sample}\\tPL:ILLUMINA' "
  envmodules:
    "bio/samtools/1.15.1",
    "aligners/bwa/0.7.17"
  shell:
    " (bwa mem {params.RG} {input.genome} {input.r1} {input.r2} | "
    " samtools view -u | "
    " samtools sort - > {output}) 2> {log} "




rule mark_duplicates:
  input:
    "results/bam/{sample}.bam"
  output:
    bam="results/mkdup/{sample}.bam",
    bai="results/mkdup/{sample}.bai",
    metrics="results/qc/mkdup_metrics/{sample}.metrics"
  log:
    "results/logs/mark_duplicates/{sample}.log"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " gatk MarkDuplicates  "
    "  --CREATE_INDEX "
    "  -I {input} "
    "  -O {output.bam} "
    "  -M {output.metrics} 2> {log} "




rule make_gvcfs:
  input:
    bam="results/mkdup/{sample}.bam",
    bai="results/mkdup/{sample}.bai",
    ref="resources/genome.fasta",
    idx="resources/genome.dict",
    fai="resources/genome.fasta.fai"
  output:
    gvcf="results/gvcf/{sample}.g.vcf.gz",
    idx="results/gvcf/{sample}.g.vcf.gz.tbi",
  params:
    java_opts="-Xmx4g"
  log:
    "results/logs/make_gvcfs/{sample}.log"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " gatk --java-options \"{params.java_opts}\" HaplotypeCaller "
    " -R {input.ref} "
    " -I {input.bam} "
    " -O {output.gvcf} "
    " -L CM031202.1    "            # just one small-ish "chromosome" for speed
    " --native-pair-hmm-threads 1 " # this is just for this small example
    " -ERC GVCF > {log} 2> {log} "




rule import_genomics_db:
  input:
    gvcfs=expand("results/gvcf/{s}.g.vcf.gz", s=SAMPLES)
  output:
    gdb=directory("results/genomics_db/CM031202.1")
  log:
    "results/logs/import_genomics_db/log.txt"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " VS=$(for i in {input.gvcfs}; do echo -V $i; done); "  # make a string like -V file1 -V file2
    " gatk --java-options \"-Xmx4g\" GenomicsDBImport "
    "  $VS  "
    "  --genomicsdb-workspace-path {output.gdb} "
    "  -L  CM031202.1 2> {log} "




rule vcf_from_gdb:
  input:
    gdb="results/genomics_db/CM031202.1",
    ref="resources/genome.fasta",
    fai="resources/genome.fasta.fai",
    idx="resources/genome.dict",
  output:
    vcf="results/vcf/all.vcf"
  log:
    "results/logs/vcf_from_gdb/log.txt"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " gatk --java-options \"-Xmx4g\" GenotypeGVCFs "
    "  -R {input.ref}  "
    "  -V gendb://{input.gdb} "
    "  -O {output.vcf} 2> {log} "

A closer look at a simple rule

(Screen grab from Sublime Text which has great highlighting for Snakemake)

The rule:

  • Requires the input file resources/genome.fasta
  • Produces the output file resources/genome.dict
  • Writes to a log file in results/logs/genome_dict.log
  • Loads the environment module bio/samtools/1.15.1
  • Uses the shell code samtools dict {input} > {output} 2> {log} to get the job done
  • What are those purple bits? {input}, {output}, and {log}?! in the shell code?
  • That is the syntax snakemake uses to substitute the values in the output, input, or log blocks (or other blocks…) into the Unix shell command.
  • Big Note: Output and log information is not written automatically to the output file and log file, nor is input taken automatically from the input file—you have to dicate that behavior by what you write in the shell block!
  • Thus, when this rule runs, the shell command executed will be:
samtools dict resources/genome.fasta > resources/genome.dict 2> results/logs/genome_dict.log 

We “drive” Snakemake by requesting the creation of output files

These output files are sometimes referred to as “targets”

  • snakemake looks for and uses the Snakefile in the current working directory.
  • Option -n tells snakemake to do a “dry-run:” (Just say what you would do, but don’t do it!)
  • Option -p tells snakemake to print the shell commands of the rules.
  • Those two options can be combined: -np
  • And we request resources/genome.dict as a target by just putting it on the command line:
Paste this into your shell
snakemake -np resources/genome.dict
  • And the output you got from that should look like:
What the output should look like
Building DAG of jobs...
Job stats:
job            count    min threads    max threads
-----------  -------  -------------  -------------
genome_dict        1              1              1
total              1              1              1


[Fri Sep  2 09:34:41 2022]
rule genome_dict:
    input: resources/genome.fasta
    output: resources/genome.dict
    log: results/logs/genome_dict.log
    jobid: 0
    resources: tmpdir=/var/folders/xg/mz_qt7q54yv_hwzvhskwx2c00000gp/T

samtools dict resources/genome.fasta > resources/genome.dict 2> results/logs/genome_dict.log
Job stats:
job            count    min threads    max threads
-----------  -------  -------------  -------------
genome_dict        1              1              1
total              1              1              1

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Let’s actually run that!

  • Remove the -np option then add --cores 1 to tell snakemake to use one core to run the requested jobs and we have to tell Snakemake to use the software modules we specified for each rule
Paste this into your shell
snakemake --cores 1 --use-envmodules resources/genome.dict
  • The output you get looks like what you saw before, but in this case the requested output file has been created.
  • And a log capturing stderr (if any) was created:
Paste this into your shell to see all the files
tree .

The output shows those two new files that were created

Output should look like this:
.
├── README.md
├── Snakefile
├── data
│   ├── A_R1.fastq.gz
│   ├── A_R2.fastq.gz
│   ├── B_R1.fastq.gz
│   ├── B_R2.fastq.gz
│   ├── C_R1.fastq.gz
│   └── C_R2.fastq.gz
├── env
│   └── snakemake-example.yml
├── resources
│   ├── adapters
│   │   ├── NexteraPE-PE.fa
│   │   ├── TruSeq2-PE.fa
│   │   ├── TruSeq2-SE.fa
│   │   ├── TruSeq3-PE-2.fa
│   │   ├── TruSeq3-PE.fa
│   │   └── TruSeq3-SE.fa
│   ├── genome.dict           <--- THIS IS A NEW FILE!
│   └── genome.fasta
└── results
    └── logs
        └── genome_dict.log   <--- THIS IS A NEW FILE!

Once a target file is created or updated Snakemake knows it

  • If you request the file resources/genome.dict from Snakemake now, it tells you that the file is there and does not need updating.
Paste this into your shell
snakemake -np resources/genome.dict
  • Because resources/genome.dict already exists (and none of its dependencies have been updated since it was created) Snakemake tells you this:
Expected output from Snakemake
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
  • This helps you to not remake output files that don’t need remaking!

Wildcards: How Snakemake manages replication

Wildcards allow running multiple instances of the same rule on different input files by simple pattern matching

  • If we request from Snakemake the file
    results/trimmed/A_R1.fastq.gz,
  • then, Snakemake recognizes that this matches the output of rule trim_reads with the wildcard {sample} replaced by A.
  • And Snakemake propagates the value A of the wildcard {sample} to the input block.
  • Thus Snakemake knows that to create
    results/trimmed/A_R1.fastq.gz
    it needs the input files:
    • data/A_R1.fastq.gz
    • data/A_R2.fastq.gz

Try requesting those trimmed fastq files

  • See what snakemake would do when you ask for results/trimmed/A_R1.fastq.gz.
Paste this into your shell
snakemake -np results/trimmed/A_R1.fastq.gz
  • Note that you can request files from more than one sample:
Paste this into your shell
snakemake -np results/trimmed/A_R1.fastq.gz results/trimmed/B_R1.fastq.gz results/trimmed/C_R1.fastq.gz  
  • Then, go ahead and run that last one, instructing Snakemake to use three cores
Paste this into your shell
snakemake --cores 2 --use-envmodules results/trimmed/A_R1.fastq.gz results/trimmed/B_R1.fastq.gz results/trimmed/C_R1.fastq.gz  

Note that it will go ahead and start all those jobs independently, and concurrently, because they do not depend on one another. This is how Snakemake manages and maximizes parallelism.

Chains of file dependencies

  • If Snakemake does not find a required input file for a rule that provides a requested output, it searches through the outputs of all the other rules in the Snakefile to find a rule that might provide the required input file as one of its outputs.
  • It then schedules all the necessary rules to run.
  • This means that an entire workflow with thousands of jobs can be triggered by requesting a single output file.

Short Breakout Room Activity

  • Trace the rules needed if we request the file results/vcf/all.vcf.
Snakefile


SAMPLES = ['A', 'B', 'C']


rule genome_faidx:
  input:
    "resources/genome.fasta",
  output:
    "resources/genome.fasta.fai",
  log:
    "results/logs/genome_faidx.log",
  envmodules:
    "bio/samtools/1.15.1"
  shell:
    "samtools faidx {input} 2> {log} "


rule genome_dict:
  input:
    "resources/genome.fasta",
  output:
    "resources/genome.dict",
  log:
    "results/logs/genome_dict.log",
  envmodules:
    "bio/samtools/1.15.1"
  shell:
    "samtools dict {input} > {output} 2> {log} "


rule bwa_index:
  input:
    "resources/genome.fasta"
  output:
    multiext("resources/genome.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa"),
  log:
    "results/logs/bwa_index/bwa_index.log"
  envmodules:
    "aligners/bwa/0.7.17"
  shell:
    "bwa index {input} 2> {log} "




rule trim_reads:
  input:
    r1="data/{sample}_R1.fastq.gz",
    r2="data/{sample}_R2.fastq.gz",
  output:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r1_unp="results/trimmed/{sample}_R1.unpaired.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    r2_unp="results/trimmed/{sample}_R2.unpaired.fastq.gz",
  log:
    "results/logs/trim_reads/{sample}.log"
  envmodules:
    "bio/trimmomatic/0.39"
  shell:
    " java -jar /opt/bioinformatics/bio/trimmomatic/trimmomatic-0.39/trimmomatic-0.39.jar  PE {input} {output} "
    "  ILLUMINACLIP:resources/adapters/TruSeq3-PE-2.fa:2:30:10  "
    "  LEADING:3  "
    "  TRAILING:3  "
    "  SLIDINGWINDOW:4:20  "
    "  MINLEN:36 2> {log} "




rule map_reads:
  input:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    genome="resources/genome.fasta",
    idx=multiext("resources/genome.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa")
  output:
    "results/bam/{sample}.bam"
  log:
    "results/logs/map_reads/{sample}.log"
  params:
    RG="-R '@RG\\tID:{sample}\\tSM:{sample}\\tPL:ILLUMINA' "
  envmodules:
    "bio/samtools/1.15.1",
    "aligners/bwa/0.7.17"
  shell:
    " (bwa mem {params.RG} {input.genome} {input.r1} {input.r2} | "
    " samtools view -u | "
    " samtools sort - > {output}) 2> {log} "




rule mark_duplicates:
  input:
    "results/bam/{sample}.bam"
  output:
    bam="results/mkdup/{sample}.bam",
    bai="results/mkdup/{sample}.bai",
    metrics="results/qc/mkdup_metrics/{sample}.metrics"
  log:
    "results/logs/mark_duplicates/{sample}.log"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " gatk MarkDuplicates  "
    "  --CREATE_INDEX "
    "  -I {input} "
    "  -O {output.bam} "
    "  -M {output.metrics} 2> {log} "




rule make_gvcfs:
  input:
    bam="results/mkdup/{sample}.bam",
    bai="results/mkdup/{sample}.bai",
    ref="resources/genome.fasta",
    idx="resources/genome.dict",
    fai="resources/genome.fasta.fai"
  output:
    gvcf="results/gvcf/{sample}.g.vcf.gz",
    idx="results/gvcf/{sample}.g.vcf.gz.tbi",
  params:
    java_opts="-Xmx4g"
  log:
    "results/logs/make_gvcfs/{sample}.log"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " gatk --java-options \"{params.java_opts}\" HaplotypeCaller "
    " -R {input.ref} "
    " -I {input.bam} "
    " -O {output.gvcf} "
    " -L CM031202.1    "            # just one small-ish "chromosome" for speed
    " --native-pair-hmm-threads 1 " # this is just for this small example
    " -ERC GVCF > {log} 2> {log} "




rule import_genomics_db:
  input:
    gvcfs=expand("results/gvcf/{s}.g.vcf.gz", s=SAMPLES)
  output:
    gdb=directory("results/genomics_db/CM031202.1")
  log:
    "results/logs/import_genomics_db/log.txt"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " VS=$(for i in {input.gvcfs}; do echo -V $i; done); "  # make a string like -V file1 -V file2
    " gatk --java-options \"-Xmx4g\" GenomicsDBImport "
    "  $VS  "
    "  --genomicsdb-workspace-path {output.gdb} "
    "  -L  CM031202.1 2> {log} "




rule vcf_from_gdb:
  input:
    gdb="results/genomics_db/CM031202.1",
    ref="resources/genome.fasta",
    fai="resources/genome.fasta.fai",
    idx="resources/genome.dict",
  output:
    vcf="results/vcf/all.vcf"
  log:
    "results/logs/vcf_from_gdb/log.txt"
  envmodules:
    "bio/gatk/4.2.0.0"
  shell:
    " gatk --java-options \"-Xmx4g\" GenotypeGVCFs "
    "  -R {input.ref}  "
    "  -V gendb://{input.gdb} "
    "  -O {output.vcf} 2> {log} "

Helpful notes:

  • expand("results/gvcf/{s}.g.vcf.gz", s=SAMPLES) is a list of files:
[results/gvcf/A.g.vcf.gz, results/gvcf/B.g.vcf.gz, results/gvcf/C.g.vcf.gz]
  • multiext("resources/genome.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa") is a list of files:
[resources/genome.fasta.amb, resources/genome.fasta.ann, resources/genome.fasta.bwt, resources/genome.fasta.pac, resources/genome.fasta.sa]

Let’s request results/vcf/all.vcf from Snakemake

  • Let’s start with a dry run:
Paste this into your shell
snakemake -np results/vcf/all.vcf  
  • After we look at that, and discuss, let’s actually run it, using 2 cores:
Paste this into your shell
snakemake -p --cores 2 --use-envmodules results/vcf/all.vcf  

That should take a minute or two.

  • If you try to run the workflow again, Snakemake tells you that you do not need to, because everything is up to date: Try running the above line again:
Paste this into your shell
snakemake -p --cores 2 --use-envmodules results/vcf/all.vcf  

Note that we are using --cores 2 here because we have only checked out two cores for ourselves. If we had checked out 20, we could do --cores 20 and have it launch 20 jobs at a time (if they were ready to go).

If any inputs change, Snakemake will re-run the rules that depend on the new input

  • Imagine that the sequencing center calls us to say that there has been a terrible mistake and they are sending you new (and correct) versions of data for sample C: C_R1.fastq.gz and C_R2.fastq.gz
  • Snakemake uses file modification dates to check if any inputs have been updated after target outputs have been created.
  • So we can simulate new fastq files for sample C by using the touch command to update the fastq file modification dates:
Paste this into your shell
touch data/C_R1.fastq.gz data/C_R2.fastq.gz
  • Now, when we run Snakemake again, it tells us we have to run more jobs, but only the ones that depend on data from sample C. Do a dry run to check that:
Paste this into your shell
snakemake -np results/vcf/all.vcf
  • Check that it will not re-run the trimming, mapping, and gvcf-making steps for samples A and B, which are aleady done.

Snakemake makes it very easy to re-run failed jobs

  • Clusters and computers fail (sometimes for no apparent reason) occasionally
  • If this happens in a large, traditionally managed (Unix script) workflow, finding and re-running the failures can be hard.
  • Example: 7 samples out of 192 fail on HaplotypeCaller because those jobs got sent to nodes without AVX acceleration.
  • Five years ago, setting up custom scripts to re-run just those 7 samples could cost me an hour—about as much time as it takes me now to set up an entire workflow with Snakemake.
  • On the next slide we are going to create a job failure to see how easy it is to re-run jobs that failed with Snakemake.

Simulating a job failure as an example

  • First, let’s remove the entire results directory and a few files we created in our resources directory so that we have to re-run all of our workflow.
Paste this into your shell
rm -rf results resources/genome.fasta.* resources/genome.dict
  • Now, we are going to corrupt the read-2 fastq file for sample A (but keeping a copy of the original)
Paste this into your shell
cp data/A_R2.fastq.gz data/A_R2.fastq.gz-ORIG
echo "GARBAGE_DATA" | gzip -c > data/A_R2.fastq.gz
  • Now, do a dry-run, requesting results/vcf/all.vcf
Paste this into your shell
snakemake -np results/vcf/all.vcf

The output ends telling us that 17 jobs will be run:

End of the expected output
Job stats:
job                   count    min threads    max threads
------------------  -------  -------------  -------------
bwa_index                 1              1              1
genome_dict               1              1              1
genome_faidx              1              1              1
import_genomics_db        1              1              1
make_gvcfs                3              1              1
map_reads                 3              1              1
mark_duplicates           3              1              1
trim_reads                3              1              1
vcf_from_gdb              1              1              1
total                    17              1              1
  • Now, run it with 2 cores and give it the --keep-going command which means that even if an error occurs on one job, all the other jobs that don’t depend on outputs from the failed job will still get run.
Paste this into your shell
snakemake --cores 2 --use-envmodules --keep-going results/vcf/all.vcf
  • Snakemake tells us that 8 of the 14 jobs were successful but at least one job failed:
Snakemake's concluding comments:
11 of 17 steps (65%) done
Exiting because a job execution failed. Look above for error message
BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - make_gvcfs: results/gvcf/A.g.vcf.gz, results/gvcf/A.g.vcf.gz.tbi
 - mark_duplicates: results/mkdup/A.bam, results/mkdup/A.bai, results/qc/mkdup_metrics/A.metrics
 - trim_reads: results/trimmed/A_R1.fastq.gz, results/trimmed/A_R1.unpaired.fastq.gz, results/trimmed/A_R2.fastq.gz, results/trimmed/A_R2.unpaired.fastq.gz
 - vcf_from_gdb: results/vcf/all.vcf
 - import_genomics_db: results/genomics_db/CM031202.1
 - map_reads: results/bam/A.bam

Cool! It tells us explicitly which jobs remain to be run. And they are exactly the ones that depend on outputs from sample A.

Re-running failed jobs is as simple as just re-starting Snakemake

  • Let’s say we notice that data/A_R2.fastq.gz was corrupted, and so we replace it with the uncorrupted version:
Paste this into your shell
cp data/A_R2.fastq.gz-ORIG data/A_R2.fastq.gz
  • Then, do a dry-run to see what Snakemake will do to finish out the workflow:
Paste this into your shell
snakemake -np results/vcf/all.vcf

It is only going to require 6 jobs to produce results/vcf/all.vcf:

This is the end of the dry-run output
Job stats:
job                   count    min threads    max threads
------------------  -------  -------------  -------------
import_genomics_db        1              1              1
make_gvcfs                1              1              1
map_reads                 1              1              1
mark_duplicates           1              1              1
trim_reads                1              1              1
vcf_from_gdb              1              1              1
total                     6              1              1
  • So, start it up with 2 cores:
Paste this into your shell
snakemake --cores 2 --use-envmodules results/vcf/all.vcf

Now that sample A is not corrupted, it finishes. Yay! That was easy.

Snakemake encourages (requires?) that your outputs all reside in a consistent directory structure

(And a side note: Snakemake automatically creates all the directories needed to store its output files)

  • Check out all the outputs of our workflow in an easy-to-understand directory structure within results:
Paste this into your shell
# only drill down three directory levels (-L 3)
tree -L 3 results

Here is what the result looks like:

The tree listing of the full results of the workflow
results
├── bam
│   ├── A.bam
│   ├── B.bam
│   └── C.bam
├── genomics_db
│   └── CM031202.1
│       ├── CM031202.1$1$6000000
│       ├── __tiledb_workspace.tdb
│       ├── callset.json
│       ├── vcfheader.vcf
│       └── vidmap.json
├── gvcf
│   ├── A.g.vcf.gz
│   ├── A.g.vcf.gz.tbi
│   ├── B.g.vcf.gz
│   ├── B.g.vcf.gz.tbi
│   ├── C.g.vcf.gz
│   └── C.g.vcf.gz.tbi
├── logs
│   ├── bwa_index
│   │   └── bwa_index.log
│   ├── genome_dict.log
│   ├── genome_faidx.log
│   ├── import_genomics_db
│   │   └── log.txt
│   ├── make_gvcfs
│   │   ├── A.log
│   │   ├── B.log
│   │   └── C.log
│   ├── map_reads
│   │   ├── A.log
│   │   ├── B.log
│   │   └── C.log
│   ├── mark_duplicates
│   │   ├── A.log
│   │   ├── B.log
│   │   └── C.log
│   ├── trim_reads
│   │   ├── A.log
│   │   ├── B.log
│   │   └── C.log
│   └── vcf_from_gdb
│       └── log.txt
├── mkdup
│   ├── A.bai
│   ├── A.bam
│   ├── B.bai
│   ├── B.bam
│   ├── C.bai
│   └── C.bam
├── qc
│   └── mkdup_metrics
│       ├── A.metrics
│       ├── B.metrics
│       └── C.metrics
├── trimmed
│   ├── A_R1.fastq.gz
│   ├── A_R1.unpaired.fastq.gz
│   ├── A_R2.fastq.gz
│   ├── A_R2.unpaired.fastq.gz
│   ├── B_R1.fastq.gz
│   ├── B_R1.unpaired.fastq.gz
│   ├── B_R2.fastq.gz
│   ├── B_R2.unpaired.fastq.gz
│   ├── C_R1.fastq.gz
│   ├── C_R1.unpaired.fastq.gz
│   ├── C_R2.fastq.gz
│   └── C_R2.unpaired.fastq.gz
└── vcf
    ├── all.vcf
    └── all.vcf.idx

Snakemake eye-candy—visualizing the workflow dependencies

Using the --dag option, like this:

Paste this into your shell
snakemake --dag results/vcf/all.vcf | dot -Tsvg > dag.svg

Makes a directed acyclic graph (DAG) of the workflow. If you view it, it looks like this:

Snakemake eye-candy—filegraphs

Using the --filegraph option, like this:

Paste this into your shell
snakemake --filegraph results/vcf/all.vcf | dot -Tsvg > filegraph.svg

Makes a graph (DAG) of the files involved in the workflow. If you view it, it looks like this:

Letting Snakemake interface with SLURM

The way to most easily allow Snakemake to dispatch jobs via the SLURM scheduler is by way of the Snakemake cluster option provided in a Snakemake profile.

A Snakemake profile is YAML file in which you can record command line options (and their arguments) for Snakemake.

There is an officially supported Snakemake profile for SLURM, but I am partial to the Unix-based (as opposed to Python-based) approach to SLURM profiles for Snakemake described at: https://github.com/jdblischak/smk-simple-slurm.

Snakemake profile for Sedna

I have stored a Snakemake profile for Sedna in our repository. It is in the directory hpcc-profiles/slurm/sedna.

The meat of the profile is in the config.yaml file:

Contents of hpcc-profiles/slurm/sedna/config.yaml
cluster:
  mkdir -p results/slurm_logs/{rule} &&
  sbatch
    --exclude=node[29-36]
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --time={resources.time}
    --job-name=smk-{rule}-{wildcards}
    --output=results/slurm_logs/{rule}/{rule}-{wildcards}-%j.out
    --error=results/slurm_logs/{rule}/{rule}-{wildcards}-%j.err
    --parsable
default-resources:
  - time="08:00:00"
restart-times: 0
max-jobs-per-second: 10
max-status-checks-per-second: 50
local-cores: 1
latency-wait: 60
cores: 600
jobs: 1200
keep-going: True
rerun-incomplete: True
printshellcmds: True
use-envmodules: True
cluster-status: status-sacct.sh
cluster-cancel: scancel
cluster-cancel-nargs: 1000

We will talk about it a little bit…

Another important part of the profile is a shell script that lets Snakemake check the status of jobs using the sacct command. It is in hpcc-profiles/slurm/sedna/status-sacct.sh.

Using the Snakemake profile to submit jobs via SLURM on Sedna

Here is how I do it:

  1. Get an interactive shell on a compute node with 2 cores (we have already done that). Check it out for three or four times longer than you think your workflow will run.
  2. Invoke Snakemake with the --profile option, the argument to which is the path hpcc-profiles/slurm/sedna (in this case, because that is where the Snakemake profile for Sedna is located)

Then, Snakemake takes care of all the interaction with SLURM for you.

So, first, let us remove all the output files (so Snakemake will have to re-make them) and do a dry run:

Paste this into your shell
# get rid of previous results
rm -rf results  resources/genome.fasta.* resources/genome.dict

# do a dry run:
snakemake -np --profile hpcc-profiles/slurm/sedna results/vcf/all.vcf

Now, let’s run the whole workflow, dipatching it through SLURM’s sbatch function.

Paste this into your shell
# if that looks good, do the run for real
snakemake -np --profile hpcc-profiles/slurm/sedna results/vcf/all.vcf

We’ve only scratched the surface

  • You can specify fine-grained conda environments for each rule
  • Python code is allowed in most places in the Snakefile
  • Input functions can be quite useful (or absolutely essential)
  • You can benchmark every job instance of a rule, which records the resources used (time, memory, etc.)
  • Pretty easy integration (once you get past the somewhat scattered documentation) with SLURM on your cluster. (I recommend checking out the resources at https://github.com/jdblischak/smk-simple-slurm)

Where to from here?

You might be interested in having a look at a workflow I wrote for whole genome sequencing of non-model organisms: https://github.com/eriqande/mega-non-model-wgs-snakeflow.

This provides a complete BWA-GATK workflow.

Final Thoughts

  • Learning snakemake may require a bit of an investment, BUT…
  • For anyone doing a lot of bioinformatic processing of sequence data it is quite a sound investment.